通过建立神经网络和内核方法之间的联系,无限宽度极限阐明了深度学习的概括和优化方面。尽管它们的重要性,但这些内核方法的实用性在大规模学习设置中受到限制,因为它们(超)二次运行时和内存复杂性。此外,大多数先前关于神经内核的作品都集中在relu激活上,这主要是由于其受欢迎程度,但这也是由于很难计算此类内核来进行一般激活。在这项工作中,我们通过提供进行一般激活的方法来克服此类困难。首先,我们编译和扩展激活功能的列表,该函数允许精确的双重激活表达式计算神经内核。当确切的计算未知时,我们提出有效近似它们的方法。我们提出了一种快速的素描方法,该方法近似于任何多种多层神经网络高斯过程(NNGP)内核和神经切线核(NTK)矩阵,以实现广泛的激活功能,这超出了常见的经过分析的RELU激活。这是通过显示如何使用任何所需激活函​​数的截短的Hermite膨胀来近似神经内核来完成的。虽然大多数先前的工作都需要单位球体上的数据点,但我们的方法不受此类限制的影响,并且适用于$ \ Mathbb {r}^d $中的任何点数据集。此外,我们为NNGP和NTK矩阵提供了一个子空间嵌入,具有接近输入的距离运行时和接近最佳的目标尺寸,该目标尺寸适用于任何\ EMPH {均质}双重激活功能,具有快速收敛的Taylor膨胀。从经验上讲,关于精确的卷积NTK(CNTK)计算,我们的方法可实现$ 106 \ times $速度,用于在CIFAR-10数据集上的5层默特网络的近似CNTK。
translated by 谷歌翻译
尽管通常认为在高维度中学习受到维度的诅咒,但现代的机器学习方法通​​常具有惊人的力量,可以解决广泛的挑战性现实世界学习问题而无需使用大量数据。这些方法如何打破这种诅咒仍然是深度学习理论中的一个基本开放问题。尽管以前的努力通过研究数据(D),模型(M)和推理算法(i)作为独立模块来研究了这个问题,但在本文中,我们将三胞胎(D,M,I)分析为集成系统和确定有助于减轻维度诅咒的重要协同作用。我们首先研究了与各种学习算法(M,i)相关的基本对称性,重点是深度学习中的四个原型体系结构:完全连接的网络(FCN),本地连接的网络(LCN)和卷积网络,而无需合并(有和没有合并)( GAP/VEC)。我们发现,当这些对称性与数据分布的对称性兼容时,学习是最有效的,并且当(d,m,i)三重态的任何成员不一致或次优时,性能会显着恶化。
translated by 谷歌翻译
了解神经网络大规模成功背后的基本原则是深度学习中最重要的开放性问题之一。但是,由于问题的高度复杂性,进展相对缓慢。在本说明中,通过无限宽度网络的镜头,A.K.A.神经内核,我们介绍了由分层本地产生的一个这样的原则。众所周知,无限宽度多层感知者(MLP)的特征结构仅取决于概念频率,从而测量相互作用的顺序。我们表明来自深度卷积网络(CNNS)的拓扑结构将相关的EIGenspace重组为更精细的子空间。除了频率之外,新结构还取决于概念空间,该空间测量非线性交互条款之间的空间距离。由此产生的细粒度的特征结构大大提高了网络的可读性,使它们能够同时模拟更丰富的相互作用,包括远程低频相互作用,短程 - 高频相互作用和各种插值和外插和外推 - 之间。此外,模型缩放可以改善内插和外推的分辨率,因此网络的可读性。最后,我们证明了在高维设置中任何深度的无限宽度CNN的泛化误差表征。遵循两个冠状动脉:(1)无限宽度深CNN可以在不失其富有效率的情况下打破维度的诅咒,而(2)缩放可以提高有限和无限数据制度的性能。
translated by 谷歌翻译
A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
An increasing number of public datasets have shown a marked clinical impact on assessing anatomical structures. However, each of the datasets is small, partially labeled, and rarely investigates severe tumor subjects. Moreover, current models are limited to segmenting specific organs/tumors, which can not be extended to novel domains and classes. To tackle these limitations, we introduce embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models, dubbed the CLIP-Driven Universal Model. The Universal Model can better segment 25 organs and 6 types of tumors by exploiting the semantic relationship between abdominal structures. The model is developed from an assembly of 14 datasets with 3,410 CT scans and evaluated on 6,162 external CT scans from 3 datasets. We rank first on the public leaderboard of the Medical Segmentation Decathlon (MSD) and achieve the state-of-the-art results on Beyond The Cranial Vault (BTCV). Compared with dataset-specific models, the Universal Model is computationally more efficient (6x faster), generalizes better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. The design of CLIP embedding enables the Universal Model to be easily extended to new classes without catastrophically forgetting the previously learned classes.
translated by 谷歌翻译
In this work, we tackle two vital tasks in automated driving systems, i.e., driver intent prediction and risk object identification from egocentric images. Mainly, we investigate the question: what would be good road scene-level representations for these two tasks? We contend that a scene-level representation must capture higher-level semantic and geometric representations of traffic scenes around ego-vehicle while performing actions to their destinations. To this end, we introduce the representation of semantic regions, which are areas where ego-vehicles visit while taking an afforded action (e.g., left-turn at 4-way intersections). We propose to learn scene-level representations via a novel semantic region prediction task and an automatic semantic region labeling algorithm. Extensive evaluations are conducted on the HDD and nuScenes datasets, and the learned representations lead to state-of-the-art performance for driver intention prediction and risk object identification.
translated by 谷歌翻译
New architecture GPUs like A100 are now equipped with multi-instance GPU (MIG) technology, which allows the GPU to be partitioned into multiple small, isolated instances. This technology provides more flexibility for users to support both deep learning training and inference workloads, but efficiently utilizing it can still be challenging. The vision of this paper is to provide a more comprehensive and practical benchmark study for MIG in order to eliminate the need for tedious manual benchmarking and tuning efforts. To achieve this vision, the paper presents MIGPerf, an open-source tool that streamlines the benchmark study for MIG. Using MIGPerf, the authors conduct a series of experiments, including deep learning training and inference characterization on MIG, GPU sharing characterization, and framework compatibility with MIG. The results of these experiments provide new insights and guidance for users to effectively employ MIG, and lay the foundation for further research on the orchestration of hybrid training and inference workloads on MIGs. The code and results are released on https://github.com/MLSysOps/MIGProfiler. This work is still in progress and more results will be published soon.
translated by 谷歌翻译
There are multiple scales of abstraction from which we can describe the same image, depending on whether we are focusing on fine-grained details or a more global attribute of the image. In brain mapping, learning to automatically parse images to build representations of both small-scale features (e.g., the presence of cells or blood vessels) and global properties of an image (e.g., which brain region the image comes from) is a crucial and open challenge. However, most existing datasets and benchmarks for neuroanatomy consider only a single downstream task at a time. To bridge this gap, we introduce a new dataset, annotations, and multiple downstream tasks that provide diverse ways to readout information about brain structure and architecture from the same image. Our multi-task neuroimaging benchmark (MTNeuro) is built on volumetric, micrometer-resolution X-ray microtomography images spanning a large thalamocortical section of mouse brain, encompassing multiple cortical and subcortical regions. We generated a number of different prediction challenges and evaluated several supervised and self-supervised models for brain-region prediction and pixel-level semantic segmentation of microstructures. Our experiments not only highlight the rich heterogeneity of this dataset, but also provide insights into how self-supervised approaches can be used to learn representations that capture multiple attributes of a single image and perform well on a variety of downstream tasks. Datasets, code, and pre-trained baseline models are provided at: https://mtneuro.github.io/ .
translated by 谷歌翻译
Designing better deep networks and better reinforcement learning (RL) algorithms are both important for deep RL. This work focuses on the former. Previous methods build the network with several modules like CNN, LSTM and Attention. Recent methods combine the Transformer with these modules for better performance. However, it requires tedious optimization skills to train a network composed of mixed modules, making these methods inconvenient to be used in practice. In this paper, we propose to design \emph{pure Transformer-based networks} for deep RL, aiming at providing off-the-shelf backbones for both the online and offline settings. Specifically, the Transformer in Transformer (TIT) backbone is proposed, which cascades two Transformers in a very natural way: the inner one is used to process a single observation, while the outer one is responsible for processing the observation history; combining both is expected to extract spatial-temporal representations for good decision-making. Experiments show that TIT can achieve satisfactory performance in different settings, consistently.
translated by 谷歌翻译